welcoming learners
to data science
with the tidyverse
mine çetinkaya-rundel
duke university + posit
Focus:
Data science for new learners
Philosophy:
Let them eat cake (first)!
Assumption 1:
Teach authentic tools
Assumption 2:
Teach R as the authentic tool
The tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle.
Introduction to Data Science
meta R package that loads eight core packages when invoked and also bundles numerous other packages that share a design philosophy, common grammar, and data structures
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
| Homeownership | Average loan amount | Number of applicants |
|---|---|---|
| Mortgage | $18,132 | 4,778 |
| Own | $15,665 | 1,350 |
| Rent | $14,396 | 3,848 |
Create side-by-side box plots that show the relationship between loan amount and application type based on homeownership.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
# A tibble: 9,976 × 3
# Groups: homeownership [3]
loan_amount homeownership application_type
<int> <chr> <fct>
1 28000 Mortgage individual
2 5000 Rent individual
3 2000 Rent individual
4 21600 Rent individual
5 23000 Rent joint
6 5000 Own individual
# … with 9,970 more rows
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
[input] data frame
loans |>
group_by(homeownership) |>
summarize(
avg_loan_amount = mean(loan_amount),
n_applicants = n()
) |>
arrange(desc(avg_loan_amount))# A tibble: 3 × 3
homeownership avg_loan_amount n_applicants
<chr> <dbl> <int>
1 Mortgage 18132. 4778
2 Own 15665. 1350
3 Rent 14396. 3848
[output] data frame
aggregate()
ns <- aggregate(
loan_amount ~ homeownership,
data = loans, FUN = length
)
names(ns)[2] <- "n_applicants"
avgs <- aggregate(
loan_amount ~ homeownership,
data = loans, FUN = mean
)
names(avgs)[2] <- "avg_loan_amount"
result <- merge(ns, avgs)
result[order(result$avg_loan_amount,
decreasing = TRUE), ] homeownership n_applicants avg_loan_amount
1 Mortgage 4778 18132.45
2 Own 1350 15665.44
3 Rent 3848 14396.44
aggregate()
ns <- aggregate(
loan_amount ~ homeownership,
data = loans, FUN = length
)
names(ns)[2] <- "n_applicants"
avgs <- aggregate(
loan_amount ~ homeownership,
data = loans, FUN = mean
)
names(avgs)[2] <- "avg_loan_amount"
result <- merge(ns, avgs)
result[order(result$avg_loan_amount,
decreasing = TRUE), ] homeownership n_applicants avg_loan_amount
1 Mortgage 4778 18132.45
2 Own 1350 15665.44
3 Rent 3848 14396.44
challenges: need to introduce
tapply()
tapply()
challenges: need to introduce
apply() functionsarray)boxplot()
levels <- sort(unique(loans$homeownership))
loans1 <- loans[loans$homeownership == levels[1],]
loans2 <- loans[loans$homeownership == levels[2],]
loans3 <- loans[loans$homeownership == levels[3],]
par(mfrow = c(1, 3))
boxplot(loan_amount ~ application_type,
data = loans1, main = levels[1])
boxplot(loan_amount ~ application_type,
data = loans2, main = levels[2])
boxplot(loan_amount ~ application_type,
data = loans3, main = levels[3])boxplot()